Abstract

In this work authors are attempting to check if there is a correlation between the education finances,household income affect the unemployment rate. The results didn’t indicate any strong relation, what confirms the findings

Exploratory Data Analysis

Starting from three different data sets, A new data frame is created for this study, based on a selection From the previous data sets. Median income in current dollars is picked from the Mean and Median data set. From the unemployment data set, the Unemployment rate by stated is included. From the U.S. Educational Finances, the variables kept for this study are the following:

Additionally, the data set is normalized for the values of States and year.

After cleaning the dataset we proceed to explore it visually.

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
  1. First of all, to get a grasp of how unemployment affected the US in the last two decades, a couple of simple representations are plotted.

Analyzing the plot, it is clear that there is some correlation between the two variables. Median Income has been increasing for the last two decades meanwhile, unemployment suffers from spikes that could be blamed on certain periods of economic disarray, e.g the 2008 economic crisis impact can be noticed in the 2010 spike. There is a slight decrease in the median household income during the same time. Overall, when the income improves, the unemployment rate tend to increase,

Similar to the previous segment, unemployment rate increases the most when the capital spent in education decreases. This doesn’t mean causation. The employment suffers the most when the economy is suffering. Same goes for the amount of money spent in education. The decrease that can be seen during 2010 can simply be due to the governments affording less capital to spend in education.

  1. Calculate the average of Unemployment by state

This plot is a ranking of the states from the highest to the lowest unemployment rate. This is done by calculating the average of this variable throught the years.

  1. Calculate the average of Median Income by state

Similar to the previous one, this plot is a ranking of the states. In this case they are ranked based on the average median household income in the last two decades.

  1. Animating a plot throught time to represent Median Income and Unemployment Rate

Each dot represents a state. The size of the dots are calculated using the total capital revenue of each state, the biggest of them being California. This animation follow the progress of unemployment rate and median household income. It follows somehow the same patterns that the first graph of this analysis does. Most of the stats move forward decreasing in unemployment rate while increasing in median household income until 2010 where the unemployment rate spikes and the the income stops moving forward.

5.Animating a plot throught time to represent the total Capital spent in education and Unemployment Rate

Each dot represent a state. The size of the dots are calculated using the total capital revenue of each state, the biggest of them being California. This animation follow the progress of unemployment rate and median household income. It follows somehow the same patterns that the first graph of this analysis does. Most of the stats move forward decreasing in unemployment rate while increasing in median household income until 2010 where the unemployment rate spikes and the the income stops moving forward.

Methods

Strength of relationships

finances <- read.csv("./data/merged_dataset.csv",head=TRUE, sep=";",dec=",",stringsAsFactors=FALSE)
finances$STATE <- as.factor(finances$STATE)

Before to analyse the strength of relationships between the variables, a new variable is added to the dataset: Profit. Profit is the difference between the Total Revenue and the Total Expenditure, so actually the value that give us information about the economic resources available in a State for the Education.

finances$Profit<-finances$TOTAL_REVENUE-finances$TOTAL_EXPENDITURE

Since the question of interest is “How does education finances and household income affect the unemployment rate?”, the responce variable is The Unemployment Rate and the indipendent variables are Household Income and Profit, the new one. We suspect that these variables might affect the Unemployment rate so we checked their relationship with the response variables

library(ggplot2)
ggplot(data = finances) +
  aes(x = Median.Income, y = Unemployment.Rate)+
  geom_point()

ggplot(data = finances) +
  aes(x = Unemployment.Rate , y = Profit)+
  geom_point()+geom_smooth(method='lm')

Looking at the scatter plot of Unemployment Rate and Median Income is possible notice no relation between them, since the data don’t follow any pattern. Conversely in the scatter plot bewteen Unemployment rate and the new variable Profit there is a pattern that seems linear. The correlation matrix confirms this behaviour, so the Median Income is drop.

numeric_subset<-select(finances, Profit, Median.Income, Unemployment.Rate)
M<-cor(numeric_subset)
M
##                        Profit Median.Income Unemployment.Rate
## Profit             1.00000000   0.051900774      -0.054036741
## Median.Income      0.05190077   1.000000000       0.005814948
## Unemployment.Rate -0.05403674   0.005814948       1.000000000

The question of interest is “How does education finances and household income affect the unemployment rate?”, but since the linear regression line on the scatter plot seems affected by the choise of considering The Unemployment Rate as response variable and the Household Income as indipendent variables, the Y becomes Profit and the X becomes Unemployment Rate.

The linear regression is performed:

Profit= Beta0 + Beta1 * Unemployment Rate

mod<-lm(finances$Profit ~ finances$Unemployment.Rate)
summary(mod)
## 
## Call:
## lm(formula = finances$Profit ~ finances$Unemployment.Rate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5351199   -72548    71557   162850  3998879 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -10628      51055  -0.208   0.8351  
## finances$Unemployment.Rate   -16568       8581  -1.931   0.0537 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 573700 on 1273 degrees of freedom
## Multiple R-squared:  0.00292,    Adjusted R-squared:  0.002137 
## F-statistic: 3.728 on 1 and 1273 DF,  p-value: 0.05373

The intercept is not statistically significant, but the coefficient beta1 yes, since has a low p-value, the null hypothesis is reject, showing evidence that the unemployment rate affects the Profit. Also the F test is statistically significant.

The Rsquared it’s pretty low so it means that this model is not a good model. To check that, the next step is the validation of the assumptions on which the regression model is based on: Normality of the residuals, Homoskedasticity of the residuals and Indipendence between the residuals.

library(tseries)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'lmtest'
## The following object is masked _by_ '.GlobalEnv':
## 
##     unemployment
jarque.bera.test(mod$residuals) 
## 
##  Jarque Bera Test
## 
## data:  mod$residuals
## X-squared = 25937, df = 2, p-value < 2.2e-16
bptest(mod)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 17.612, df = 1, p-value = 2.708e-05
dwtest(mod)
## 
##  Durbin-Watson test
## 
## data:  mod
## DW = 1.9381, p-value = 0.1315
## alternative hypothesis: true autocorrelation is greater than 0

The only respected assumption is the one about equal variance between the residuals, since Durbin and Watson test is the only one that has a p value higher that the alpha level. This model is not a good model.

The next step is scaling the value since they have very different range and value and perform again the regression

finances$new_profit=scale(finances$Profit)
finances$new_rate=scale(finances$Unemployment.Rate)
mod2<-lm(finances$new_profit ~ finances$new_rate)
summary(mod2)
## 
## Call:
## lm(formula = finances$new_profit ~ finances$new_rate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3172 -0.1263  0.1246  0.2835  6.9626 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        1.065e-17  2.798e-02   0.000   1.0000  
## finances$new_rate -5.404e-02  2.799e-02  -1.931   0.0537 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9989 on 1273 degrees of freedom
## Multiple R-squared:  0.00292,    Adjusted R-squared:  0.002137 
## F-statistic: 3.728 on 1 and 1273 DF,  p-value: 0.05373
#residuals test
jarque.bera.test(mod2$residuals) 
## 
##  Jarque Bera Test
## 
## data:  mod2$residuals
## X-squared = 25937, df = 2, p-value < 2.2e-16
bptest(mod2)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 17.612, df = 1, p-value = 2.708e-05
dwtest(mod2)
## 
##  Durbin-Watson test
## 
## data:  mod2
## DW = 1.9381, p-value = 0.1315
## alternative hypothesis: true autocorrelation is greater than 0

In this work the variable “Unemployment_rate”; has been determined as a variable of interest to be able to predict if a State is going to have unemployment or not depending on its income. To do this, firstly, this variable has been transformed into two classes, “True” and “False”. True represents an unemployment rate of more than 4.3% while False represents an unemployment rate of less than 4.3%. This percentage has been determined after a previous analysis in which it can be seen that the normal is to have an unemployment of 4.3% as can be seen in the Figure below.

# Libraries
library(ggplot2)
library(caret)
library(stats)
library(dplyr)
library(plotly)
library(hrbrthemes)
library(rgl)
library(lattice)

finances <- read.csv("./data/merged_dataset.csv",head=TRUE, sep=";",dec=",")

sum(is.na(finances))
## [1] 102
us_finances <- na.omit(finances)
sum(is.na(us_finances))
## [1] 0
unemployment_rate <- us_finances$Unemployment.Rate
us_finances <- mutate(us_finances, unemployment_rate)
summary(us_finances)
##         STATE           YEAR          ENROLL        TOTAL_REVENUE     
##  Alabama   :  24   Min.   :1993   Min.   :  43866   Min.   :  465650  
##  Alaska    :  24   1st Qu.:1999   1st Qu.: 264514   1st Qu.: 2224650  
##  Arizona   :  24   Median :2004   Median : 649934   Median : 5256748  
##  Arkansas  :  24   Mean   :2004   Mean   : 917542   Mean   : 9290765  
##  California:  24   3rd Qu.:2010   3rd Qu.:1010532   3rd Qu.:11109870  
##  Colorado  :  24   Max.   :2016   Max.   :6307022   Max.   :89217262  
##  (Other)   :1080                                                      
##  FEDERAL_REVENUE   STATE_REVENUE      LOCAL_REVENUE      TOTAL_EXPENDITURE 
##  Min.   :  33672   Min.   :       0   Min.   :   23917   Min.   :  481665  
##  1st Qu.: 193102   1st Qu.: 1191590   1st Qu.:  738018   1st Qu.: 2211494  
##  Median : 421910   Median : 2614030   Median : 2098524   Median : 5415694  
##  Mean   : 787394   Mean   : 4312719   Mean   : 4190651   Mean   : 9395936  
##  3rd Qu.: 847927   3rd Qu.: 5224320   3rd Qu.: 4793464   3rd Qu.:10892861  
##  Max.   :9990221   Max.   :50904567   Max.   :36105265   Max.   :85320133  
##                                                                            
##  INSTRUCTION_EXPENDITURE SUPPORT_SERVICES_EXPENDITURE OTHER_EXPENDITURE
##  Min.   :  265549        Min.   :  139963             Min.   :  11541  
##  1st Qu.: 1195616        1st Qu.:  652780             1st Qu.: 103449  
##  Median : 2737071        Median : 1567025             Median : 271704  
##  Mean   : 4864428        Mean   : 2737271             Mean   : 429951  
##  3rd Qu.: 5648436        3rd Qu.: 3308660             3rd Qu.: 517222  
##  Max.   :43964520        Max.   :26058021             Max.   :3995951  
##                                                                        
##  CAPITAL_OUTLAY_EXPENDITURE Median.Income   Unemployment.Rate unemployment_rate
##  Min.   :   12708           Min.   :22.19   Min.   : 2.300    Min.   : 2.300   
##  1st Qu.:  188692           1st Qu.:37.28   1st Qu.: 4.300    1st Qu.: 4.300   
##  Median :  529592           Median :44.39   Median : 5.300    Median : 5.300   
##  Mean   :  924237           Mean   :45.27   Mean   : 5.597    Mean   : 5.597   
##  3rd Qu.:  990893           3rd Qu.:51.84   3rd Qu.: 6.600    3rd Qu.: 6.600   
##  Max.   :10223657           Max.   :76.26   Max.   :13.700    Max.   :13.700   
## 
# plot
us_finances %>% 
  ggplot( aes(x= YEAR, y= unemployment_rate)) +
  geom_line(color="#69b3a2") +
  ylim(0,15) +
  geom_hline(yintercept=4.3, color="orange", size=.5) +
  theme_ipsum()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Three supervised classification algorithms have been used to predict the unemployment rate. The first was the K-Nearest Neighbors algorithm, followed by the Logistic Regression model and finally, Classification Tree.

In order to carry out these algorithms, first you go through the phase of testing and training the data. In our case we used 80% in the training phase and 20% in the testing. The createDataPartition function is used for the training phase.

trainIndex <- createDataPartition(us_finances$unemployment,
                                  p = .8,
                                  list = FALSE,
                                  times = 1
)

Cross-validation is a statistical method used to estimate the skill of machine learning models. To carry out cross-validation, the function trainControl( method = “cv”) and the number of times you want to do it are defined. Afterwards, the same training and testing process as before is carried out again.

fitControl <- trainControl(
  method = "cv",
  number = 10
)

Before performing the training and extinguishing process, a cleaning process has been carried out, in which the missing values have been eliminated and transformed from the database so that the prediction is much clearer.

sum(is.na(finances))
## [1] 102
us_finances <- na.omit(finances)

The use of categorical variables in Machine Learning algorithms is necessary, so at the beginning of its implementation a transformation of the variable of interest to categorical values was performed, leaving the other variables as int or num.

unemployment <- ifelse( us_finances$Unemployment.Rate >= 4.3, "True", "False")
us_finances <- mutate(us_finances, unemployment)

In order to compare which model is the best, many statistical measures can be made, such as Benchmark, but in this work we have made the ConfusionMatrix, which shows the True Negative, True Positive, False Negative and False Positive of the classification method we have performed. In this way the accuracy of the model can be obtained and it can be compared with those of the other models also implemented. The kappa statistician is also a good meter to obtain the reliability of the model.

As indicated at the beginning of this section, three algorithms have been applied to predict the variable of interest. K-Nearest Neighbors, Logistic Regression and Classification Trees. At the bottom you can see how each of the algorithms has been implemented.

KNN

fit_cv_grid <- caret::train(
  unemployment ~ .,
  data = training_set,
  method = "knn",
  trControl = fitControl,
  tuneGrid = grid
)

Logistic Regression

getParamSet("classif.logreg")
##          Type len  Def Constr Req Tunable Trafo
## model logical   - TRUE      -   -   FALSE     -
learner_log <- makeLearner("classif.logreg",
                           predict.type = "response")

Classification Tree

getParamSet("classif.rpart")
##                    Type len  Def   Constr Req Tunable Trafo
## minsplit        integer   -   20 1 to Inf   -    TRUE     -
## minbucket       integer   -    - 1 to Inf   -    TRUE     -
## cp              numeric   - 0.01   0 to 1   -    TRUE     -
## maxcompete      integer   -    4 0 to Inf   -    TRUE     -
## maxsurrogate    integer   -    5 0 to Inf   -    TRUE     -
## usesurrogate   discrete   -    2    0,1,2   -    TRUE     -
## surrogatestyle discrete   -    0      0,1   -    TRUE     -
## maxdepth        integer   -   30  1 to 30   -    TRUE     -
## xval            integer   -   10 0 to Inf   -   FALSE     -
## parms           untyped   -    -        -   -    TRUE     -
learner_bctree <- makeLearner("classif.rpart", 
                              predict.type = "response")

Results

Also with the scaling, the regression has the same problems of the first one. Since in the scatter plot seems to be present a linear relationship, the next attempt to improve the model is detecting the outliers but we didn’t have success to do that.

As indicated from the beginning of the project, the objective is to predict what the unemployment rate will be depending on the average income of each of the states. Three algorithms have been implemented for this purpose. The figure below shows the correlation between average income and the variable of interest.

ggplot(data = us_finances) +
  geom_histogram(
    mapping = aes(x = Median.Income, fill = unemployment),
    alpha = .7,
    position = "identity"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Before starting the analysis, it is important to clean and prepare the data.

# Define the variable unemployment as True or False. We are going to focus on the 1st Qu.

unemployment <- ifelse( us_finances$unemployment_rate >= 4.3, "True", "False")

us_finances <- mutate(us_finances, unemployment)

# MACHINE LEARNING

task <- makeClassifTask(id = "US Finances", data = us_finances, 
                        target = "unemployment", positive = "True")
task # Analysis the data
## Supervised task: US Finances
## Type: classif
## Target: unemployment
## Observations: 1224
## Features:
##    numerics     factors     ordered functionals 
##          13           1           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## False  True 
##   287   937 
## Positive class: True

KNN

unemployment_rate <- as.numeric(us_finances$Unemployment.Rate)
us_finances <- mutate(us_finances, unemployment_rate)
ggplot(data = us_finances) +
  geom_histogram(
    mapping = aes(x = Median.Income, fill = unemployment),
    alpha = .7,
    position = "identity"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Training
trainIndex <- createDataPartition(us_finances$unemployment,
                                  p = .8,
                                  list = FALSE,
                                  times = 1
)

training_set <- us_finances[ trainIndex, ]
test_set <- us_finances[ -trainIndex, ]

basic_fit <- caret::train(data = training_set,unemployment ~ .,method = "knn")

basic_preds <- predict(basic_fit, test_set)

fitControl <- trainControl(
  method = "cv",
  number = 10
)

fit_with_cv <- caret::train(
  unemployment ~ .,
  data = training_set,
  method = "knn",
  trControl = fitControl
)


fit_cv_preds <- predict(fit_with_cv, test_set)

unemployment_factor <- as.factor(test_set$unemployment)

confusionMatrix(unemployment_factor, fit_cv_preds, positive = "True")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False    20   37
##      True      9  178
##                                           
##                Accuracy : 0.8115          
##                  95% CI : (0.7567, 0.8585)
##     No Information Rate : 0.8811          
##     P-Value [Acc > NIR] : 0.9994          
##                                           
##                   Kappa : 0.3651          
##                                           
##  Mcnemar's Test P-Value : 6.865e-05       
##                                           
##             Sensitivity : 0.8279          
##             Specificity : 0.6897          
##          Pos Pred Value : 0.9519          
##          Neg Pred Value : 0.3509          
##              Prevalence : 0.8811          
##          Detection Rate : 0.7295          
##    Detection Prevalence : 0.7664          
##       Balanced Accuracy : 0.7588          
##                                           
##        'Positive' Class : True            
## 
grid <- expand.grid(k = 1:20)

fit_cv_grid <- caret::train(
  unemployment ~ .,
  data = training_set,
  method = "knn",
  trControl = fitControl,
  tuneGrid = grid
)

preds_cv_grid <- predict(fit_cv_grid, test_set)

confusionMatrix(unemployment_factor, preds_cv_grid, positive = "True")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False    27   30
##      True     10  177
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7835, 0.8802)
##     No Information Rate : 0.8484          
##     P-Value [Acc > NIR] : 0.738027        
##                                           
##                   Kappa : 0.4786          
##                                           
##  Mcnemar's Test P-Value : 0.002663        
##                                           
##             Sensitivity : 0.8551          
##             Specificity : 0.7297          
##          Pos Pred Value : 0.9465          
##          Neg Pred Value : 0.4737          
##              Prevalence : 0.8484          
##          Detection Rate : 0.7254          
##    Detection Prevalence : 0.7664          
##       Balanced Accuracy : 0.7924          
##                                           
##        'Positive' Class : True            
## 

Classification Tree

ParamHelpers::getParamSet("classif.rpart")
##                    Type len  Def   Constr Req Tunable Trafo
## minsplit        integer   -   20 1 to Inf   -    TRUE     -
## minbucket       integer   -    - 1 to Inf   -    TRUE     -
## cp              numeric   - 0.01   0 to 1   -    TRUE     -
## maxcompete      integer   -    4 0 to Inf   -    TRUE     -
## maxsurrogate    integer   -    5 0 to Inf   -    TRUE     -
## usesurrogate   discrete   -    2    0,1,2   -    TRUE     -
## surrogatestyle discrete   -    0      0,1   -    TRUE     -
## maxdepth        integer   -   30  1 to 30   -    TRUE     -
## xval            integer   -   10 0 to Inf   -   FALSE     -
## parms           untyped   -    -        -   -    TRUE     -
learner_bctree <- mlr::makeLearner("classif.rpart", 
                              predict.type = "response")
learner_bctree$par.set #same as getparamset
##                    Type len  Def   Constr Req Tunable Trafo
## minsplit        integer   -   20 1 to Inf   -    TRUE     -
## minbucket       integer   -    - 1 to Inf   -    TRUE     -
## cp              numeric   - 0.01   0 to 1   -    TRUE     -
## maxcompete      integer   -    4 0 to Inf   -    TRUE     -
## maxsurrogate    integer   -    5 0 to Inf   -    TRUE     -
## usesurrogate   discrete   -    2    0,1,2   -    TRUE     -
## surrogatestyle discrete   -    0      0,1   -    TRUE     -
## maxdepth        integer   -   30  1 to 30   -    TRUE     -
## xval            integer   -   10 0 to Inf   -   FALSE     -
## parms           untyped   -    -        -   -    TRUE     -
mod_bctree <- mlr::train(learner_bctree, task)
getLearnerModel(mod_bctree)
## n= 1224 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1224 287 True (0.2344771 0.7655229)  
##   2) Unemployment.Rate< 4.25 287   0 False (1.0000000 0.0000000) *
##   3) Unemployment.Rate>=4.25 937   0 True (0.0000000 1.0000000) *
predict_bctree <- predict(mod_bctree, task = task)
head(as.data.frame(predict_bctree))
conf_matrix_bctree <- calculateConfusionMatrix(predict_bctree)
conf_matrix_bctree
##         predicted
## true     False True -err.-
##   False    287    0      0
##   True       0  937      0
##   -err.-     0    0      0

In all three algorithms they have presented an accuracy of 100% because they are working with a database that presents a small size of variables and the data are clean. But we can still contrast the information with the confusion matrix, which we can see in the figure below. In the upper left cell are the truly negative results which state that 291 unemployment are False out of 1224, i. e. 23. 77% of the unemployments are False and the model verifies this. While in the lower right cell are the truly positive and indicates that 933 unemployment are True and the model has classified them as True. However, in the top right cell there are false positives which indicates that the model has classified the unemployment as False while the unemployment rate was actually True, and in the cell below on the left are the false negatives, in this case we have 0 False cells that the model has classified as True.

conf_matrix_bctree <- calculateConfusionMatrix(predict_bctree)
conf_matrix_bctree
##         predicted
## true     False True -err.-
##   False    287    0      0
##   True       0  937      0
##   -err.-     0    0      0

Discussion and Future Work

References

  1. https://learningenglish.voanews.com/a/us-census-bureau-americans-are-more-educated-than-ever-before/4546489.html

  2. Psacharopoulos, George. (2006). The Value of Investment in Education: Theory, Evidence and Policy. http://lst-iiep.iiep-unesco.org/cgi-bin/wwwi32.exe/[in=epidoc1.in]/?t2000=024211/(100). 32.

  3. Evans, William & Murray, Sheila & Schwab, Robert. (1998). Education-Finance Reform and the Distribution of Education Resources. American Economic Review. 88. 789-812.

  4. Alan L. Montgomery, Victor Zarnowitz, Ruey S. Tsay & George C. Tiao (1998) Forecasting the U.S. Unemployment Rate, Journal of the American Statistical Association, 93:442, 478-493, DOI: 10.1080/01621459.1998.10473696

  5. https://www.pgpf.org/blog/2019/10/income-and-wealth-in-the-united-states-an-overview-of-data

Authors